Hello, my name is Bryan Fricke.
I'm a newly minted Product Manager at JMP, focusing on the JMP user experience.
Previously, I was a software developer working on exporting reports
to stand alone HTML file files, JMP Live, and JMP Public.
In this presentation,
I'm going to talk about using confidence curves
as an alternative to Null H ypothesis Significance Testing
in the context of predictive model screening.
Additional material on this subject can be found on the JMP Community website
and the paper associated with this presentation.
Dr. Russ Wolfinger, who is a distinguished research fellow at JMP, is a co -author,
and I would like to thank him for his contributions.
The Model Screening platform introduced in Jump Pro 16,
allows you to evaluate the performance
of multiple predictive models using cross- validation.
To show you how the Model Screening platform works,
I'm going to use the Diabetes Data table,
which is available in the JMP Sample Data Library.
I'll choose Model Screening from the Analyze Predictive Modeling menu.
JMP responds by displaying the Model Screening dialogue.
The first three columns in the data table
represent disease progression in continuous, binary, and ordinal forms.
I'll use the continuous column named Y as the response variable.
I'll use all the columns from Age to Glucose in the X Factor role.
I'll type 1-2-3-4 in the Set Random Seed input box for reproducibility.
I'll select the check box next to K Fold Cross validation
and leave K set to five.
I'll type three into the input box next to Repeated K Fold.
In the Method list, I'll unselect Neural,
and now I'll select OK.
JMP responds by training and validating models for each of the selected Methods
using their default parameter settings and cross validation.
After completing the training and validating process,
JMP displays the results in a new window.
For each modeling method, the Model Screening platform provides
performance measures in the form of point estimates
for the coefficient of determination,
which is also known as R squared, the root average squared error,
and the standard deviation for the root average squared error.
Now I'll click Select Dominant.
JMP responds by highlighting the method that performs best
across the performance measures.
What's missing here is a graphic
to show you the size of the difference between the Dominant Method
and the other methods, along with the visualization
of the uncertainty associated with the differences.
But why not just show P values indicating whether the differences are significant?
Shouldn't a decision about whether one model is superior to another
be based on significance?
First,
since a P value provides a probability based on a standardized difference,
a P value by itself loses information about the raw difference,
so a significant difference doesn't imply a meaningful difference.
But is that really a problem?
I mean, isn't it pointless to be concerned
with the size of a difference between two models before significance testing is used
to determine whether the difference is real?
The problem with that line of thinking is that it's power, or one minus beta,
that determines our ability to correctly reject a null hypothesis.
Authors such as Jacob Cohen and Frank Schmidt
have suggested that typical studies have a power to detect differences
in the range of 0.4 to 0.6 .
Let's suppose we have a difference where the power to detect a true difference
is 0.5 at an Alpha value of 0.05 .
That suggests we would detect the true difference
on average 50 percent of the time.
In that case, significance testing would identify real differences
no better than flipping an unbiased coin.
If all other things are equal, Type One and Type Two errors are equivalent.
But significance tests that use an Alpha value of .05
often implicitly assume Type T wo errors are preferable to Type One errors,
particularly if the power is as low as 0.5 .
A common suggestion to address these and other issues with significance testing
is to show the point estimate along with confidence intervals.
One objection to doing so is that a point estimate
along with a 95 percent confidence interval
is effectively the same thing as significance testing.
Even if we assume that is true, a point estimate and confidence interval
still puts the magnitude of the difference in the range of the uncertainty
front and center,
whereas a lone P value conceals them both.
Various authors, including Cohen and Schmidt,
have recommended replacing significance testing
with point estimates and confidence intervals.
Even so, the recommendation to use confidence intervals
begs the question, "Which ones do we show?"
Showing only the 95 percent confidence interval
would likely encourage you to interpret it as another form of significance testing.
The solution provided by confidence curves
is to literally show all confidence intervals
up to an arbitrarily high confidence level.
So how do you show confidence curves in JMP?
To conveniently create confidence curves in JMP,
install the confidence curves add- in by visiting the JMP Community homepage.
Type "confidence curves" into the search input field.
Click on the first entry that appears.
Now click the download icon next to Confidence Curves.jmp add in,
and now you can click on the downloaded file.
JMP responds by asking if I want to install the add- in.
You would click "Install."
However, I'll click "Cancel," as I already have the add- in installed.
How do you use the add- in?
First, to generate confidence curves for this report,
select "Save Results Table"
from the top red triangle menu located on the model screen report window.
JMP responds by creating a new table containing, among others,
the following columns:
Trial, which contains the identifiers for three sets of cross validation results;
Fold, which contains the Identifiers for the five distinct sets of subsamples
used for validation in each trial;
Method, which contains the methods used to create models from the test data,
and N, which contains the number of data points used in the validation folds.
Note that the trial column will be missing if the number of repeats is exactly one,
in which case the Trial column is neither created nor needed.
Save for that exception,
these columns are essential for the confidence curves add- in
to function properly.
In addition to these columns,
you need one column that provides the metric to compare methods.
I'll be using R squared as the metric of interest in this presentation.
Once you have the Model Screen and Results Table,
click "Add- Ins" from the JMP's main menu bar and then s elect "Confidence Curves."
The logic that follows would be better placed in a wizard,
and I hope to add that functionality in a future release of the add- in.
As it is, the first dialogue that appears
requests you to select the name of the table that was generated
when you chose "Save Results Table"
from the Model Screening Reports red triangle menu.
The name of the table in this case is Model Screening Statistics Validation Set.
Next, a dialog is displayed that requests the name of the method
that will serve as the baseline
from which all the other performance metrics are measured.
I suggest starting with the method
that was selected when you clicked the Select Dominant, or in this case,
I selected or clicked Select Dominant option
in the Model Screen and Report window,
which in this case is fit step wise.
Finally, a dialogue is displayed
that requests you to select the metric to be compared between the various methods.
As mentioned earlier, I'll use R squared as the metric for comparison.
Jump responds by creating a confidence curve table that contains P values
and corresponding confidence levels
for the mean metric difference between the chosen baseline method
and each of the other methods.
More specifically, the generated table has columns for the following:
Model, in which each row contains the name of the modeling method
whose performance is evaluated relative to the baseline method;
P value,
in which each row contains the probability associated with the performance difference
at least as extreme as the values shown in the Difference in R Square column;
Confidence Interval, in which each row contains the confidence level we have
that the true mean is contained in the associated interval.
And finally, Difference in R Square, in which each row is the maximum or minimum
of the expected difference in R squared associated with the confidence level
shown in the Confidence Interval column.
From this table, confidence curves are created
and shown in the Graph Builder graph.
What are confidence curves?
To clarify the key attributes of a confidence curve,
I'll hide all but the Support Vector Machines confidence curve
using the local data filter by clicking on Support Vector Machines.
By default, a confidence curve only shows the lines
that connect the extremes of each confidence interval.
To see the points, select Show Control Panel from the red triangle menu
located next to the text that reads Graph Builde r in the title bar.
Now I'll Shift click the Points icon.
JMP responds by displaying the end points of the confidence intervals
that make up the confidence curve.
Now I will zoom in and examine a point.
If you hover the mouse pointer over any of these points,
a hover label shows the P value, confidence interval,
difference in the size of the metric,
and the method used to generate the model being compared to the reference.
Now I will turn off the points by Shift clicking the Points icon
and clicking the Done button.
Even though the individual points are no longer shown,
you can still view the associated hover labels
by placing the mouse pointer over the confidence curve.
The point estimate for the mean difference in performance
between Support Vector M achines and Fit step wise
is shown at the 0 percent confidence level,
which is the mean value of the differences computed using cross validation.
Confidence curve plots the extent of each confidence interval
from the generated table between the zero and 99. 99 percent confidence level,
which is an arbitrarily high value.
Along the left Y axis,
P values associated with the confidence intervals are shown.
Along the right Y axis,
the confidence level associated with each confidence interval is shown.
The Y axis uses a log scale,
so that more resolution is shown at higher confidence levels.
By default, two reference lines are plotted alongside a confidence curve.
The vertical line represents the traditional null hypothesis
of no difference in effect.
Note you can change the vertical line position
and thereby the implicit null hypothesis
in the X axis settings.
The horizontal line
passes through the conventional 95 percent confidence interval.
As with the vertical reference line,
you can change the horizontal line position
and thereby the implicit level of significance,
by changing the Y axis settings.
If a confidence curve crosses the vertical line above the horizontal line,
you cannot reject the null hypothesis using significance testing.
For example, we cannot reject the null hypothesis for support vector machines.
On the other hand,
if a confidence curve crosses the vertical line below the horizontal line,
you can reject the null hypothesis.
For example, we can reject the null hypothesis for Boosted Tree.
How are confidence curves computed?
The current implementation of confidence curves
assumes the differences are computed using R times repeated
K Fold cross validation.
The extent of each confidence interval is computed
using what is known as a Variance Corrected Resampled T-Test.
Note that authors Claude Nadeau and Yoshua Bengio
noted that a corrected resampled T- test
is typically used in cases where training sets
are five or ten times larger than validation sets.
For more details, please see the paper associated with this presentation.
So how are confidence curves interpreted?
First, a confidence curve graphically depicts
the main difference in the metric of interest between a given method
and a reference method at the 0 percent confidence level.
We can evaluate whether the mean difference between methods is meaningful.
If the mean difference isn't meaningful, there's little point in further analysis
of the given method versus the reference method with respect to the chosen metric.
What constitutes a meaningful difference depends on the metric of interest
as well as the intended scientific or engineering application.
For example, you can see the model developed with a decision tree method
is on average about 14 percent worse than Fit step wise,
which arguably is a meaningful difference.
If the difference is meaningful,
we can evaluate how precisely the difference has been measured
by evaluating the width of the associated confidence intervals.
For any confidence interval not crossing the default vertical axis,
we have at least that level of confidence that the mean difference is non-zero.
For example, the decision tree confidence curve doesn't cross the Y axis
until about the 99. 98 percent confidence level,
so we are nearly 99. 98 percent confident the mean difference isn't equal to zero.
In fact, with this data set,
it turns out that we can be about 81 percent confident
that Fit step wise is at least as good, if not better,
than every method other than generalized regression lasso.
Now let's consider the relationship between confidence curves.
If two or more confidence curves significantly overlap
and the mean difference of each is not meaningfully different from the other,
the data suggest each method performs
about the same as the other with respect to the reference model.
For example, we can see that on average,
the S upport Vector Machines model performs
less than 0.5 percent better than Bootstrap Forest,
which is arguabl not a meaningful difference.
And the confidence intervals
do not overlap until about the four percent confidence level,
which suggests these values would be expected
if both methods really do have about the same difference in performance
with respect to the reference.
If the average difference in performance
is about the same for two confidence curves
but the confidence intervals don't overlap too much,
the data suggests the models perform about the same as each other
with respect to the reference model.
However, we are confident of a non- meaningful difference.
This particular case is rarer than the others, and
I don't have an example to show with this data set.
On the other hand,
if the average difference in performance between a pair of confidence curves
is meaningfully different and the confidence curves have little overlap,
the data suggests the models perform differently from one another
with respect to the reference.
For example, the generalized regression lasso model
predicts about 13. 8 percent more of the variation in the response
than does the decision tree model.
Moreover, the confidence curves don't overlap
until about the 99. 9 percent confidence level,
which suggests these results are quite unusual
if the methods actually perform about the same with respect to the reference.
Finally, if the average difference in performance
between a pair of confidence curves
is meaningfully different from one another and the curves have considerable overlap,
the data suggests that while the methods perform differently from one another
with respect to the reference,
it wouldn't be surprising if the differences are the differences spurious.
For example, we can see that on average,
Support Vector Machines predicted about 1.4 percent more of the variance
in the response than did K Nearest Neighbors.
However, the confidence intervals begin to overlap
at about the 17 percent confidence level,
which suggests it wouldn't be surprising if the difference in performance
between each method in the reference
is actually smaller than suggested by the point estimates.
Simultaneously,
it wouldn't be surprising if the actual difference is larger than measured,
or if the direction of the difference is actually reversed.
In other words, the difference in performance is uncertain.
Note that it isn't possible to assess the variability in performance
between two models relative to one another
when the differences are relative to a third model.
To compare the variability in performance between two methods
relative to one another,
one of the two methods must be the reference method
from which the differences are measured.
But what about multiple comparisons?
Don't we need to adjust the P values
to control the family wise Type One error rate?
In his paper about confidence curves, Daniel Barr suggests
that adjustments are needed in confirmatory studies
where a goal is prespecified, but not in exploratory studies.
This suggests using unadjusted P values for multiple confidence curves
in an exploratory fashion
and only a single confidence curve generated from different data
to confirm your finding of a significant difference between two methods
when using significance testing.
That said, please keep in mind the dangers of cherry picking
and P hacking when conducting exploratory studies.
In summary, the Model Screening platform introduced in JMP Pro 16
provides a means to simultaneously compare the performance
of predictive models created using different methodologies.
JMP has a long- standing goal to provide a graph with every statistic,
and confidence curves help to fill that gap
for the Model Screening platform.
You might naturally expect to use significance testing to differentiate
between the performance of the various methods being compared.
However, P values have come under increased scrutiny in recent years
for obscuring the size of performance differences.
In addition,
P values are often misinterpreted as the probability the null hypothesis is false.
Instead, a P value is the probability of observing a difference as or more extreme,
assuming the null hypothesis is true.
The probability of correctly rejecting the null hypothesis
when it is false is determined by power, or one minus beta.
I've argued that it is not uncommon to only have a 50 percent chance
of correctly rejecting the null hypothesis with an Alpha value of 0.05 .
As an alternative, a confidence interval could be shown instead of a lone P value.
However, the question would be left open as to which confidence level to show.
Confidence curves address these concerns
by showing all confidence intervals up to an arbitrarily high level of confidence.
The mean difference in performance
is clearly visible at the zero percent confidence level,
and that acts as a point estimate.
All other things being equal,
Type One and Type Two errors are equivalent,
so confidence curves don't embed a bias towards trading T ype One errors
for Type Two.
Even so, by default, a vertical line is shown in the confidence curve graph
for the standard null hypothesis of no difference.
In addition, a horizontal line is shown that delineates
the 95 percent confidence interval,
which readily affords a typical significance testing analysis if desired.
The defaults for these lines are easily modified
if a different null hypothesis and confidence level is desired.
Even so, given the rather broad and sometimes emphatic suggestion
to replace significance testing with point estimates and confidence intervals,
it may be best to view a confidence curve as a point estimate
along with a nearly comprehensive view of its associated uncertainty.
If you have feedback about the confidence curves add- in,
please leave a comment on the JMP C ommunity site
and don't forget to vote for this presentation
if you found it interesting and or useful.
Thank you for watching this presentation, and I hope you have a great day.