Comparing Predictive Model Performance with Confidence Curves (2022-EU-45MP-1039)

4 Kudos

Bryan Fricke, Senior Product Manager, JMP
Russ Wolfinger, Director, Research and Development, JMP

Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform.

Hello, my name is Bryan Fricke.

I'm a newly minted Product Manager at JMP, focusing on the JMP user experience.

Previously, I was a software developer working on exporting reports

to stand alone HTML file files, JMP Live, and JMP Public.

In this presentation,

I'm going to talk about using confidence curves

as an alternative to Null H ypothesis Significance Testing

in the context of predictive model screening.

Additional material on this subject can be found on the JMP Community website

and the paper associated with this presentation.

Dr. Russ Wolfinger, who is a distinguished research fellow at JMP, is a co -author,

and I would like to thank him for his contributions.

The Model Screening platform introduced in Jump Pro 16,

allows you to evaluate the performance

of multiple predictive models using cross- validation.

To show you how the Model Screening platform works,

I'm going to use the Diabetes Data table,

which is available in the JMP Sample Data Library.

I'll choose Model Screening from the Analyze Predictive Modeling menu.

JMP responds by displaying the Model Screening dialogue.

The first three columns in the data table

represent disease progression in continuous, binary, and ordinal forms.

I'll use the continuous column named Y as the response variable.

I'll use all the columns from Age to Glucose in the X Factor role.

I'll type 1-2-3-4 in the Set Random Seed input box for reproducibility.

I'll select the check box next to K Fold Cross validation

and leave K set to five.

I'll type three into the input box next to Repeated K Fold.

In the Method list, I'll unselect Neural,

and now I'll select OK.

JMP responds by training and validating models for each of the selected Methods

using their default parameter settings and cross validation.

After completing the training and validating process,

JMP displays the results in a new window.

For each modeling method, the Model Screening platform provides

performance measures in the form of point estimates

for the coefficient of determination,

which is also known as R squared, the root average squared error,

and the standard deviation for the root average squared error.

Now I'll click Select Dominant.

JMP responds by highlighting the method that performs best

across the performance measures.

What's missing here is a graphic

to show you the size of the difference between the Dominant Method

and the other methods, along with the visualization

of the uncertainty associated with the differences.

But why not just show P values indicating whether the differences are significant?

Shouldn't a decision about whether one model is superior to another

be based on significance?

First,

since a P value provides a probability based on a standardized difference,

a P value by itself loses information about the raw difference,

so a significant difference doesn't imply a meaningful difference.

But is that really a problem?

I mean, isn't it pointless to be concerned

with the size of a difference between two models before significance testing is used

to determine whether the difference is real?

The problem with that line of thinking is that it's power, or one minus beta,

that determines our ability to correctly reject a null hypothesis.

Authors such as Jacob Cohen and Frank Schmidt

have suggested that typical studies have a power to detect differences

in the range of 0.4 to 0.6 .

Let's suppose we have a difference where the power to detect a true difference

is 0.5 at an Alpha value of 0.05 .

That suggests we would detect the true difference

on average 50 percent of the time.

In that case, significance testing would identify real differences

no better than flipping an unbiased coin.

If all other things are equal, Type One and Type Two errors are equivalent.

But significance tests that use an Alpha value of .05

often implicitly assume Type T wo errors are preferable to Type One errors,

particularly if the power is as low as 0.5 .

A common suggestion to address these and other issues with significance testing

is to show the point estimate along with confidence intervals.

One objection to doing so is that a point estimate

along with a 95 percent confidence interval

is effectively the same thing as significance testing.

Even if we assume that is true, a point estimate and confidence interval

still puts the magnitude of the difference in the range of the uncertainty

front and center,

whereas a lone P value conceals them both.

Various authors, including Cohen and Schmidt,

have recommended replacing significance testing

with point estimates and confidence intervals.

Even so, the recommendation to use confidence intervals

begs the question, "Which ones do we show?"

Showing only the 95 percent confidence interval

would likely encourage you to interpret it as another form of significance testing.

The solution provided by confidence curves

is to literally show all confidence intervals

up to an arbitrarily high confidence level.

So how do you show confidence curves in JMP?

To conveniently create confidence curves in JMP,

install the confidence curves add- in by visiting the JMP Community homepage.

Type "confidence curves" into the search input field.

Click on the first entry that appears.

Now click the download icon next to Confidence Curves.jmp add in,

and now you can click on the downloaded file.

JMP responds by asking if I want to install the add- in.

You would click "Install."

However, I'll click "Cancel," as I already have the add- in installed.

How do you use the add- in?

First, to generate confidence curves for this report,

select "Save Results Table"

from the top red triangle menu located on the model screen report window.

JMP responds by creating a new table containing, among others,

the following columns:

Trial, which contains the identifiers for three sets of cross validation results;

Fold, which contains the Identifiers for the five distinct sets of subsamples

used for validation in each trial;

Method, which contains the methods used to create models from the test data,

and N, which contains the number of data points used in the validation folds.

Note that the trial column will be missing if the number of repeats is exactly one,

in which case the Trial column is neither created nor needed.

Save for that exception,

these columns are essential for the confidence curves add- in

to function properly.

In addition to these columns,

you need one column that provides the metric to compare methods.

I'll be using R squared as the metric of interest in this presentation.

Once you have the Model Screen and Results Table,

click "Add- Ins" from the JMP's main menu bar and then s elect "Confidence Curves."

The logic that follows would be better placed in a wizard,

and I hope to add that functionality in a future release of the add- in.

As it is, the first dialogue that appears

requests you to select the name of the table that was generated

when you chose "Save Results Table"

from the Model Screening Reports red triangle menu.

The name of the table in this case is Model Screening Statistics Validation Set.

Next, a dialog is displayed that requests the name of the method

that will serve as the baseline

from which all the other performance metrics are measured.

I suggest starting with the method

that was selected when you clicked the Select Dominant, or in this case,

I selected or clicked Select Dominant option

in the Model Screen and Report window,

which in this case is fit step wise.

Finally, a dialogue is displayed

that requests you to select the metric to be compared between the various methods.

As mentioned earlier, I'll use R squared as the metric for comparison.

Jump responds by creating a confidence curve table that contains P values

and corresponding confidence levels

for the mean metric difference between the chosen baseline method

and each of the other methods.

More specifically, the generated table has columns for the following:

Model, in which each row contains the name of the modeling method

whose performance is evaluated relative to the baseline method;

P value,

in which each row contains the probability associated with the performance difference

at least as extreme as the values shown in the Difference in R Square column;

Confidence Interval, in which each row contains the confidence level we have

that the true mean is contained in the associated interval.

And finally, Difference in R Square, in which each row is the maximum or minimum

of the expected difference in R squared associated with the confidence level

shown in the Confidence Interval column.

From this table, confidence curves are created

and shown in the Graph Builder graph.

What are confidence curves?

To clarify the key attributes of a confidence curve,

I'll hide all but the Support Vector Machines confidence curve

using the local data filter by clicking on Support Vector Machines.

By default, a confidence curve only shows the lines

that connect the extremes of each confidence interval.

To see the points, select Show Control Panel from the red triangle menu

located next to the text that reads Graph Builde r in the title bar.

Now I'll Shift click the Points icon.

JMP responds by displaying the end points of the confidence intervals

that make up the confidence curve.

Now I will zoom in and examine a point.

If you hover the mouse pointer over any of these points,

a hover label shows the P value, confidence interval,

difference in the size of the metric,

and the method used to generate the model being compared to the reference.

Now I will turn off the points by Shift clicking the Points icon

and clicking the Done button.

Even though the individual points are no longer shown,

you can still view the associated hover labels

by placing the mouse pointer over the confidence curve.

The point estimate for the mean difference in performance

between Support Vector M achines and Fit step wise

is shown at the 0 percent confidence level,

which is the mean value of the differences computed using cross validation.

Confidence curve plots the extent of each confidence interval

from the generated table between the zero and 99. 99 percent confidence level,

which is an arbitrarily high value.

Along the left Y axis,

P values associated with the confidence intervals are shown.

Along the right Y axis,

the confidence level associated with each confidence interval is shown.

The Y axis uses a log scale,

so that more resolution is shown at higher confidence levels.

By default, two reference lines are plotted alongside a confidence curve.

The vertical line represents the traditional null hypothesis

of no difference in effect.

Note you can change the vertical line position

and thereby the implicit null hypothesis

in the X axis settings.

The horizontal line

passes through the conventional 95 percent confidence interval.

As with the vertical reference line,

you can change the horizontal line position

and thereby the implicit level of significance,

by changing the Y axis settings.

If a confidence curve crosses the vertical line above the horizontal line,

you cannot reject the null hypothesis using significance testing.

For example, we cannot reject the null hypothesis for support vector machines.

On the other hand,

if a confidence curve crosses the vertical line below the horizontal line,

you can reject the null hypothesis.

For example, we can reject the null hypothesis for Boosted Tree.

How are confidence curves computed?

The current implementation of confidence curves

assumes the differences are computed using R times repeated

K Fold cross validation.

The extent of each confidence interval is computed

using what is known as a Variance Corrected Resampled T-Test.

Note that authors Claude Nadeau and Yoshua Bengio

noted that a corrected resampled T- test

is typically used in cases where training sets

are five or ten times larger than validation sets.

For more details, please see the paper associated with this presentation.

So how are confidence curves interpreted?

First, a confidence curve graphically depicts

the main difference in the metric of interest between a given method

and a reference method at the 0 percent confidence level.

We can evaluate whether the mean difference between methods is meaningful.

If the mean difference isn't meaningful, there's little point in further analysis

of the given method versus the reference method with respect to the chosen metric.

What constitutes a meaningful difference depends on the metric of interest

as well as the intended scientific or engineering application.

For example, you can see the model developed with a decision tree method

is on average about 14 percent worse than Fit step wise,

which arguably is a meaningful difference.

If the difference is meaningful,

we can evaluate how precisely the difference has been measured

by evaluating the width of the associated confidence intervals.

For any confidence interval not crossing the default vertical axis,

we have at least that level of confidence that the mean difference is non-zero.

For example, the decision tree confidence curve doesn't cross the Y axis

until about the 99. 98 percent confidence level,

so we are nearly 99. 98 percent confident the mean difference isn't equal to zero.

In fact, with this data set,

it turns out that we can be about 81 percent confident

that Fit step wise is at least as good, if not better,

than every method other than generalized regression lasso.

Now let's consider the relationship between confidence curves.

If two or more confidence curves significantly overlap

and the mean difference of each is not meaningfully different from the other,

the data suggest each method performs

about the same as the other with respect to the reference model.

For example, we can see that on average,

the S upport Vector Machines model performs

less than 0.5 percent better than Bootstrap Forest,

which is arguabl not a meaningful difference.

And the confidence intervals

do not overlap until about the four percent confidence level,

which suggests these values would be expected

if both methods really do have about the same difference in performance

with respect to the reference.

If the average difference in performance

is about the same for two confidence curves

but the confidence intervals don't overlap too much,

the data suggests the models perform about the same as each other

with respect to the reference model.

However, we are confident of a non- meaningful difference.

This particular case is rarer than the others, and

I don't have an example to show with this data set.

On the other hand,

if the average difference in performance between a pair of confidence curves

is meaningfully different and the confidence curves have little overlap,

the data suggests the models perform differently from one another

with respect to the reference.

For example, the generalized regression lasso model

predicts about 13. 8 percent more of the variation in the response

than does the decision tree model.

Moreover, the confidence curves don't overlap

until about the 99. 9 percent confidence level,

which suggests these results are quite unusual

if the methods actually perform about the same with respect to the reference.

Finally, if the average difference in performance

between a pair of confidence curves

is meaningfully different from one another and the curves have considerable overlap,

the data suggests that while the methods perform differently from one another

with respect to the reference,

it wouldn't be surprising if the differences are the differences spurious.

For example, we can see that on average,

Support Vector Machines predicted about 1.4 percent more of the variance

in the response than did K Nearest Neighbors.

However, the confidence intervals begin to overlap

at about the 17 percent confidence level,

which suggests it wouldn't be surprising if the difference in performance

between each method in the reference

is actually smaller than suggested by the point estimates.

Simultaneously,

it wouldn't be surprising if the actual difference is larger than measured,

or if the direction of the difference is actually reversed.

In other words, the difference in performance is uncertain.

Note that it isn't possible to assess the variability in performance

between two models relative to one another

when the differences are relative to a third model.

To compare the variability in performance between two methods

relative to one another,

one of the two methods must be the reference method

from which the differences are measured.

But what about multiple comparisons?

Don't we need to adjust the P values

to control the family wise Type One error rate?

In his paper about confidence curves, Daniel Barr suggests

that adjustments are needed in confirmatory studies

where a goal is prespecified, but not in exploratory studies.

This suggests using unadjusted P values for multiple confidence curves

in an exploratory fashion

and only a single confidence curve generated from different data

to confirm your finding of a significant difference between two methods

when using significance testing.

That said, please keep in mind the dangers of cherry picking

and P hacking when conducting exploratory studies.

In summary, the Model Screening platform introduced in JMP Pro 16

provides a means to simultaneously compare the performance

of predictive models created using different methodologies.

JMP has a long- standing goal to provide a graph with every statistic,

and confidence curves help to fill that gap

for the Model Screening platform.

You might naturally expect to use significance testing to differentiate

between the performance of the various methods being compared.

However, P values have come under increased scrutiny in recent years

for obscuring the size of performance differences.

In addition,

P values are often misinterpreted as the probability the null hypothesis is false.

Instead, a P value is the probability of observing a difference as or more extreme,

assuming the null hypothesis is true.

The probability of correctly rejecting the null hypothesis

when it is false is determined by power, or one minus beta.

I've argued that it is not uncommon to only have a 50 percent chance

of correctly rejecting the null hypothesis with an Alpha value of 0.05 .

As an alternative, a confidence interval could be shown instead of a lone P value.

However, the question would be left open as to which confidence level to show.

Confidence curves address these concerns

by showing all confidence intervals up to an arbitrarily high level of confidence.

The mean difference in performance

is clearly visible at the zero percent confidence level,

and that acts as a point estimate.

All other things being equal,

Type One and Type Two errors are equivalent,

so confidence curves don't embed a bias towards trading T ype One errors

for Type Two.

Even so, by default, a vertical line is shown in the confidence curve graph

for the standard null hypothesis of no difference.

In addition, a horizontal line is shown that delineates

the 95 percent confidence interval,

which readily affords a typical significance testing analysis if desired.

The defaults for these lines are easily modified

if a different null hypothesis and confidence level is desired.

Even so, given the rather broad and sometimes emphatic suggestion

to replace significance testing with point estimates and confidence intervals,

it may be best to view a confidence curve as a point estimate

along with a nearly comprehensive view of its associated uncertainty.

If you have feedback about the confidence curves add- in,

please leave a comment on the JMP C ommunity site

and don't forget to vote for this presentation

if you found it interesting and or useful.

Thank you for watching this presentation, and I hope you have a great day.

MBoulanger · ‎03-11-2022

Thanks Bryan Fricke for a very clear and inspiring presentation.

My post is maybe a bit off-topic, but you mentioned we could continue the discussion here and your presentation inspired this...

It is great to be able to see which models are “equivalent in terms of significance with confidence curve” and that can help choosing a model with many other considerations like if it would be best at interpretation, communication, operation, generalization.

But I would really like to decide which model is best based on the test data set.

With low sample size, I would also like to repeat this test many times for different data set splitting.

I would like to do something similar to Frederick Kistner in

https://community.jmp.com/t5/Discovery-Summit-Europe-2022/It-s-Otterly-Confusing-Short-Clawed-Hairy-...

He is choosing models using the model screening platform in JMP pro (e.g. 3 best models from training dataset with validation data used as criterion) and making final decision based on test set (completely new data not seen by the models) with formula depot… see from minute 21 to 30.

However, on small sample sizes, with many types of batches and grouping in the data, there are many different kind of splitting of the datasets that would make sense (which the subject matter expert would have to inform JMP about, e.g. giving as input the parameters that are batch/groups numbers) and many of these different dataset splitting might have poor stratification.

I wonder if it is possible to use the automatic JMP tool to make cross-validation and repetition against all these different data splits where the metrics of Rsquare, etc. are evaluated on the test datasets and not the validation dataset.

Do you have analytical suggestions for that type of problem?

Or should I drop the Test dataset if I have too little data? I hate doing that.

I am tempted to say that I can't conclude before I have made a DOE and maybe 2 or 3, but I would still like to get the most of my historic data first.

Thanks

Martin

Bryan_Fricke · ‎03-21-2022

Hi Martin,

Thank you for your question. For methods other than Neural and Decision Tree, you can use the Model Screening platform to determine the best method through repeated or nested cross validation. Afterwards, I'd recommend using the best method to fit a model using all the data.

The above approach won't work for the neural or decision tree methods as the validation set is used in model development. For those methods, I recommend splitting off some portion of the data set, say 20%, for testing generalizability.

As an alternative, you could use an ensemble of models. Peter Hersh provides an introduction to ensemble modeling in JMP here.

The particulars of your situation may be best addressed in a conversation at a Meet the Experts session on either 22 March or 24 March at Discovery Summit Europe online.

Best,

Bryan

MBoulanger · ‎06-09-2022

Thank you, Bryan, for your very good and condensed answer.

In what I saw so far, I think your suggestion of ensemble of models, SVEM, would be the best solution to the problem I was discussing where the data sample size is very small.

Best, Martin