Hello. My name is Bryan Fricke.
I'm a product manager at JMP focused on the JMP user experience.
Previously, I was a software developer working on exporting reports
to standalone HTML fire files.
JMP Live and JMP Public.
In this presentation,
I'm going to talk about using Confidence Curves
as an alternative to null hypothesis significance testing
in the context of predictive model screening.
Additional material on this subject can be found on the JMP Community website
in the paper associated with this presentation.
Dr. Russ Wolfinger is a Distinguished Research Fellow at JMP
and a co- author, and I would like to thank him for his contributions.
The Model Screening Platform, introduced in JMP Pro 16
allows you to evaluate the performance
of multiple predictive models using cross validation.
To show you how the Model Screening platform works,
I'm going to use the Diabetes Data table,
which is available in the JMP sample data library.
I'll choose model screening from the analyzed predictive modeling menu.
JMP responds by displaying the Model Screening dialogue.
The first three columns in the data table
represent disease progression in continuous binary and ordinal forms.
I'll use the continuous column named Y as the response variable.
I'll use the columns from age to glucose in the X factor role.
I'll type 1234 in the set random seed input box for reproducibility,
I'll select the check box next to K-Fold cross validation and leave K set to five.
I'll type 3 into the input box next to repeated K-F old.
In the method list, I'll unselect neural.
Now I'll click Okay.
JMP responds by training and validating models for each of the selected methods
using their default parameter settings and cross validation.
After completing the training and validating process,
JMP displays the results in a new window.
For each modeling method.
The Model Screening platform provides performance measures
in the form of point estimates for the coefficient of determination,
also known as R squared, the root average squared error,
and the standard deviation for the root average squared error.
Now I'll click select dominant.
JMP responds by highlighting the method
that performs best across the performance measures.
What's missing here is a graphic to show the size of the differences
between the dominant method and the other methods,
along with the visualization of the uncertainty
associated with the differences.
But why not just show P- values indicating whether the differences are significant?
Shouldn't a decision about whether one model is superior to another
be based on significance?
First, since the P- value provides a probability
based on a standardized difference,
a P-value by itself loses information about the raw difference.
A a significant difference doesn't imply a meaningful difference.
Is that really a problem?
I mean, isn't it pointless
to be concerned with the size of the difference between two models
before using significance testing
to determine whether the difference is real?
The problem with that line of thinking is that it's power or one minus beta
that determines our ability to correctly reject a null hypothesis.
Authors such as Jacob Cohen and Frank Smith
have suggested that typical studies have the power to detect differences
in the range of .4 to .6 .
So let's suppose we have a difference.
Where the power to detect a true difference?
Zero five at an alpha level zero five.
That suggests we would detect the true difference, on average 50% of the time.
So in that case, significance testing would identify real differences
no better than flipping an unbiased coin.
If all other things are equal, type 1 and type 2 errors are equivalent.
But significance tests that use an alpha value of .05.
Often implicitly assume type 2 errors are preferable to type 1 errors,
particularly if the power is as low as .5 .
A common suggestion to address these and other issues with significance testing
is to show the point estimate along with the confidence intervals.
One objection to doing so is that
a point estimate along with a 95% confidence interval
is effectively the same thing as significance testing.
Even if we assume that is true, a point estimate and confidence interval
still puts the magnitude of the difference
and the range of the uncertainty front and center,
whereas a loan P-value conceals them both.
So various authors, including Cohen and Smith,
have recommended replacing significance testing
with point estimates and confidence intervals.
Even so, the recommendation to use confidence intervals begs the question,
which ones do we show?
Showing only the 95% confidence interval
would likely encourage you to interpret it as another form of significance testing.
The solution provided by Confidence Curves
is to literally show all confidence intervals
up to an arbitrarily high confidence level.
How do I show Confidence Curves and JMP?
To conveniently create Confidence Curves and JMP,
install the Confidence Curves add-in by visiting the JMP Community Homepage.
Type Confidence Curves into the search input field.
Click the Confidence Curves result.
Now click the download icon next to confidencecurves. JMPa dd- in.
Now click the downloaded file.
JMP responds by asking if I want to install the add- in.
You would click Install.
However, I'll click cancel as I've already installed the add-in.
So how do you use the add-in?
First.
To generate Confidence Curves for this report, select Save Results table
from the top red triangle menu located on the Model Screening Report window.
JMP responds by creating a new table containing among others.
The following columns trial.
Which contains the identifiers for the three sets
of cross validation results.
Fold.
Which contains the identifiers
for the five distinct sets of subsamples used for validation in each trial.
Method.
Which contains the methods used to create models from the test data.
And in.
Which contains the number of data points used in the validation folds.
Note that the trial column will be missing if the number of repeats is exactly one,
in which case the trial column is neither created nor needed.
Save for that exception, these columns are essential
for the Confidence Curves add-in to function properly.
In addition to these columns, you need one column
that provides the metric to compare between methods.
I'll be using R squared as the metric of interest in this presentation.
Once you have the model screening results table,
click add-ins from JMPs main menu bar and then select Confidence Curves.
The logic that follows would be better placed in a wizard,
and I hope to add that functionality in a future release of this add-in.
As it is, the first dialog that appears
requests you to select the name of the table
that was generated when you chose 'save results table',
from the Model Screen Reports Red Triangle menu.
The name of the table in this case is Model Screen Statistics Validation Set.
Next, a dialogue is displayed
that requests the name of the method that will serve as the baseline
from which all the other performance metrics are measured.
I suggest starting with the method that was selected when you click
the Select Dominant option
in the Model Screening Report window, which in this case is Fit Stepwise.
Finally, a dialog is displayed
that requests you to select the metric to be compared between the various methods.
As mentioned earlier, I'll use R squared as the metric for comparison.
JMP responds by creating a Confidence Curve table that contains P-values
and corresponding confidence levels for the mean difference
between the chosen baseline method and each of the other methods.
More specifically, the generated table has columns for the following.
Model.
In which each row contains the name of the modeling method
whose performance is evaluated relative to the baseline method.
P- value in which each row contains the probability
associated with the performance difference
at least as extreme as the value shown in the difference in our square column.
Confidence interval in which each row contains the confidence level.
We have that the true mean is contained in the associated interval
and finally, difference in our square,
in which each row is the maximum or minimum
of the expected difference in R squared
associated with the confidence level shown in the confidence interval column.
From this table, Confidence Curves are created
and shown in the Graph Builder graph.
So what are Confidence Curves?
To clarify the key attributes of a Confidence Curve,
I'll hide all but the Support Vector machine's Confidence Curve
using the local data filter by clicking on Support Vector Machines.
By default, a Confidence Curve only shows
the lines that connect the extremes of each confidence interval.
To see the points, select Show Control Panel from the red triangle menu
located next to the text that reads Graph Builder in the Title bar.
Now I'll shift click the points icon.
JMP responds by displaying the endpoints of the confidence intervals
that make up the Confidence Curve.
Now I will zoom in and examine a point.
If you hover the mouse pointer over any of these points.
A hover label shows the P - value confidence interval,
difference in the size of the metric
and the method used to generate the model
being compared to the reference model.
Now we'll turn off the points by shift clicking the points icon
and clicking the Done button.
Even though the individual points are no longer shown,
you can still view the associated hover label
by placing the mouse pointer over the Confidence Curve.
Point estimate for the main difference
in performance between the sport vector machines and Fit Step Wise models
is shown at the 0% confidence level,
which is the mean value of the differences computed using cross validation.
A Confidence Curve plots the extent
of each confidence interval from the generated table between zero
and the 99.99% confidence level along with the left Y axis.
P values associated with the confidence intervals are shown
along the right y axis.
The confidence level associated with each confidence interval shown.
The Y axis uses a log scale
so that more resolution is shown at higher confidence levels.
By default, two reference lines are plotted alongside a Confidence Curve.
The vertical line represents
the traditional null hypothesis of no difference in effect.
Note you can change the vertical line position
and thereby the implicit null hypothesis.
In the X axis settings.
The horizontal line passes through the conventional 95% confidence interval.
As with the vertical reference line,
you can change the horizontal line position
and thereby the implicit level of significance
by changing the Y axis settings.
If a Confidence Curve crosses the vertical line above the horizontal line,
you cannot reject the null hypothesis using significance testing.
For example, we cannot reject the null hypothesis for support vector machines.
On the other hand, if a Confidence Curve
crosses the vertical line below the horizontal line,
you can reject the null hypothesis using significance testing.
For example, we can reject the null hypothesis for boosted tree.
How are Confidence Curves computed?
The current implementation of confidence curves assumes
the differences are computed
using R times repeated K-fold cross validation.
The extent of each confidence interval is computed
using what is known as a variance corrected resampled T-test.
Note that authors Claude Nadeau and Yoshua Bengio,
note that a corrected resampled T-test
is typically used in cases where training sets
are five or ten times larger than validation sets.
For more details, please see the paper associated with this presentation.
So how are Confidence Curves interpreted?
First, the Confidence Curve graphically depicts
the main difference in the metric of interest between a given method
and a reference method at the 0% confidence level.
So we can evaluate whether the mean difference between the methods
is meaningful.
If the main difference isn't meaningful, there's little point in further analysis
of a given method versus the reference method with respect to the chosen metric.
What constitutes a meaningful difference depends on the metric of interest
as well as the intended scientific or engineering application.
For example,
you can see the model developed with a decision tree method
is on average about 14% worse than Fit Step Wise,
which arguably is a meaningful difference.
If the difference is meaningful,
we can evaluate how precisely the difference has been measured
by evaluating how much the Confidence C urve width changes
across the confidence levels.
For any confidence interval not crossing the default vertical axis,
we have at least that level of confidence that the main difference is nonzero.
For example,
the decision tree confidence curve doesn't cross the Y axis
until about the 99.98% confidence level.
We are nearly 99.98% confident the mean difference isn't equal to zero.
In fact, with this data,
it turns out that we can be about 81% confident
that Fit Step Wise is at least as good,
if not better, than every method other than generalized regression Lasso.
Now let's consider the relationship between Confidence Curves.
If two or more Confidence Curves significantly overlap
and the mean difference of each is not meaningfully different from the other.
The data suggest each method performs about the same as the other
with respect to the reference model.
So for example,
we see that on average, the Support vector Machines model
performs less than . 5% than Bootstrap Forest,
which is arguably not a meaningful difference.
The confidence intervals do not overlap until about the 4% confidence level,
which suggests these values would be expected
if both methods really do have about the same difference in performance
with respect to the reference.
If the average difference in performance
is about the same for two confidence curves,
but the confidence intervals don't overlap too much,
the data suggests
the models perform about the same as each other with respect to the reference model.
However, we are confident of a non- meaningful difference.
This particular case is rare than the others
and I don't have an example to show with this data set.
On the other hand, if the average difference in performance
between a pair of Confidence Curves is meaningfully different
and the confidence curves have little overlap,
the data suggests the models perform
differently from one another with respect to the reference.
For example, the generalized regression Lasso model predicts about 13.8%
more of the variation in the response than does the decision tree model.
Moreover, the Confidence Curves don't overlap
until about the 99.9% confidence level,
which suggests these results are quite unusual
if the methods actually perform about the same with respect to the reference.
Finally, if the average difference in performance between a pair
of Confidence Curves is meaningfully different from one another
and the curves have considerable overlap,
the data suggests that
while the methods perform differently from one another with respect to the reference,
it wouldn't be surprising if the difference is furious.
For example, we can see that on average
support vector machines predicted about 1.4% more of the variance in the response
than did [inaudible 00:19:18] nearest neighbors.
However, the confidence intervals begin to overlap at about the 17% confidence level,
which suggests it wouldn't be surprising
if the difference in performance between each method
and the reference is actually smaller than suggested by the point estimates.
Simultaneously,
it wouldn't be surprising if the actual difference is larger than measured,
or if the direction of the difference is actually reversed.
In other words, the difference in performance is uncertain.
Note that it isn't possible to assess the variability and performance
between two models relative to one another
when the differences are relative to a third model.
To compare the variability and performance
between two methods relative to one another,
one of the two methods must be the reference method
from which the differences are measured.
But what about multiple comparisons?
Don't we need to adjust the P-values
to control the family wise type 1 error rate?
In this paper about Confidence Curves,
Daniel Burrough suggests that adjustments are needed in confirmatory studies
where a goal is prespecified, but not in exploratory studies.
This idea suggests using unadjusted P-values for multiple Confidence Curves
in an exploratory fashion
and only a single Confidence Curve generated from different data
to confirm your findings of a significant difference
between two methods when using significance testing.
That said, please keep in mind the dangers of cherry picking, p-hacking
when conducting exploratory studies.
In summary, the model screening platform introduced in JMP Pro 16
provides a means to simultaneously compare the performance
of predictive models created using different methodologies.
JMP has a long standing goal to provide a graph with every statistic,
and Confidence Curves help to fill that gap.
For the model screening platform.
You might naturally expect to use significance testing to differentiate
between the performance of various methods being compared.
However, P-values have come under increased scrutiny in recent years
for obscuring the size of performance differences.
In addition,
P-values are often misinterpreted as the probability the null hypothesis is true.
Instead of P-value is the probability of observing a difference
as or more extreme, assuming the null hypothesis is true.
The probability of correctly rejecting the null hypothesis when it is false
is determined by a power or one minus beta.
I have argued that it is not uncommon to only have a 50% chance
of correctly rejecting the null hypothesis with an alpha value of .05 .
As an alternative, a confidence interval could be shown instead of a loan P- value.
However, the question would be left open as to which confidence level to show.
Confidence Curves address these concerns by showing all confidence intervals
up to an arbitrarily high- level of confidence.
The mean difference in the performance
is clearly visible at the 0% confidence level, and that acts as a point estimate.
All of the things being equal, type 1 and type 2 errors are equivalent.
Confidence Curves don't embed a bias towards trading type 1 errors for type 2.
Even so, by default,
a vertical line is shown in the confidence curve graph
for the standard null hypothesis of no difference.
In addition,
a horizontal line is shown that delineates the 95% confidence interval,
which readily affords a typical significance testing analysis if desired.
The defaults for these lines are easily modified,
but different null hypothesis and confidence levels desired.
Even so, given the rather broad and sometimes emphatic suggestion
to replace significance testing with point estimates and confidence intervals,
it may be best to view a Confidence Curve as a point estimate
along with a nearly comprehensive view of its associated uncertainty.
If you have feedback about the confidence curve's add-in,
please leave a comment on the JMP community site.
And don't forget to vote for this presentation
if you found it interesting or useful.
Thank you for watching this presentation and I hope you have a great day.