Well, thank you everyone for joining me.
This is a Discovery Summit 2022 presentation,
courtesy of my co-presenters, Charles Chen and Mason Chen.
My name is Patrick Giuliano.
The title of this talk is,
Box Plot A nalysis: Blending Scientific and Artistic Enquiry
in Uni variate Response Characterization.
Here's the abstract.
You can find this on the JMP User C ommunity
in the Discovery 2022 community page,
US D iscovery 2022 community page.
I'm putting it here for reference.
I will provide a link in the slides
to the community page where the project will live.
What's the motivation for this project?
The Box Plot is one of the most popular graphical tools
to visualize a univariate distribution of data.
This project studies how to use the Box Plot to analyze data effectively.
Most people who use the Box Plot don't use it necessarily
to determine the shape of the distribution of the data.
In fact, many people use it wrongly to draw mean or mean comparison decisions,
and they may assume normality based on symmetry,
when in fact, the normality assumption
would actually not be reasonable if they were to take a closer look
at the shape of the data on a histogram, for example.
The objective of this project is to demonstrate how to use JMP
specifically 16, to interpret information in a Box Plot
and to improve proficiency
in a global community of scientists and engineers
that are really under a DMEIC or APS,
or Lean type Six Sigma methodology,
which is very popular, obviously, today and over the last few decades.
The interesting thing about this project is we framed it
in the context of 17 quiz questions.
This is a question and answer slide deck.
And I'm not going to go into too much detail
about each and every question,
whic h I will show you here.
But what I'd like to do is show you a little bit about
how you can use JMP to explore the answer to these questions,
because I think that's really the most interesting and fun part.
The first thing I wanted to do is just quickly go over what a Box Plot is.
So what is the anatomy of a Box Plot?
Just as a refresher for some of you, or introduction for some of you,
the median is indicated by the midline,
and it's referred to as the second quartile, Q2
or the 50th percentile.
Then Q1 is referred to as the 25th percentile.
Q3 is the 75th percentile, as you can see here.
The interquartile range, or IQR,
is the difference between Q3 and Q1.
The other important elements are we have what's called
a whisker on the lower side and on the upper side.
Right at the end of that whisker, sometimes we refer to it as the fence
and you'll see a vertical line
and JMP draws a vertical line to indicate that edge.
What's important about this is that this location is actually
Q1 minus 1.5 times the IQR,
which is represented by the distance between this edge and this edge.
This point and the upper fence is
Q3 plus one and a half times the IQR.
So that defines the upper edge.
Then any points that are beyond these edges or fences
are considered potential outliers.
And they actually show up by themselves as points
whereas the rest of the data in the middle of the histogram
is not shown for emphasis on the points that are beyond the beyond these fences.
I'm going to jump right in.
How did we explore and develop the answers to these questions,
and in some cases, even refine the questions themselves?
Well, we created a simulated data table in JMP 16,
where we constructed 100 rows of data,
and we constructed data, first from a normal distribution,
and then applied some transformation to that data.
We see that we have a normally distributed data
drawn from a population the mean of zero and a standard deviation of one.
Then we have uniformly distributed data.
Then we have data that's peaked,
i.e has a positive Kurtosis.
Then we have data that's right skewed,
has two modes,
has some outliers about 3 % on average
and then integers.
In all the cases, with the exception of the bimodal,
we just based the simulated formula,
on the original normal column.
The way that we put all the data on the same scale
is we use the column standardized function
so that we could compare all the data
relative to each other in the distribution platform.
This is just a preview of that.
I'll jump over to JMP and show you that.
But again, all of this data is centered at a mean of approximately zero
and a standard deviation, approximately one.
We covered the first question.
Why i s a Box Plot, sometimes referred to as a five- point plot?
Well, there are five main points .
There's Q1, there's Q2 in the middle, there's Q3,
and then there are the whiskers, the upper and lower .
Next question.
What are the two ways that the Box Plot can determine
whe ther the distribution is skewed?
Well, we can look at the width of the box itself.
We can also look at the width of the whiskers.
In this right skewed example,
you can see that upper whisker is much longer than the lower one.
So that would imply that the data is right skewed.
In other words, that the tail in the data,
if you were to imagine that distribution, is pointing to the right.
Third, why does the Box Plot include the median and not the mean?
Well, a Box Plot uses the median to determine or gauge skewness.
So if the distribution is normal, then the mean is equal to the median.
And in fact, what you would see here is that this median line in the middle
would line up exactly with the edges of this diamond,
the middle of this diamond.
In that case, you would effectively have a situation
where you're not really losing any information
because the distribution is symmetric.
The median in general, then might be considered better,
regardless of whether the distribution is normal or non-normal.
Fourth, why is the Box Plot the most powerful visualization tool,
or one of the most powerful tools to separate skewness and outlier problems?
When we talked about this idea that because the Box Plot uses
this Q1 minus one and half times IQR
and Q3 plus one and a half times IQR methodology,
it really allows us to separate potential outliers from the main data.
It also gives us a framework
by which to judge whether the upper whisker
is larger or smaller than a lower whisker.
So those two components of the plot
really help us, rather, see if we're trained skewness
and potential outlying this.
This is a unique feature of the Box Plot.
The fifth question is a little more interesting.
What's the relationship between the interquartile range,
that distance between Q1 and Q3,
and the standard deviation, which we can calculate for any data set,
regardless of how it's distributed?
If the data is normal,
what about if the data is skewed or non-normal or peaked
or any other shape?
Well, we know, based on theory,
that the ratio of the IQR, to the standard deviation
is 1.35 for normal data.
What would that ratio look like if the data wasn't normal?
Well, we can explore that in JMP,
and I'm going to show you that really quickly.
Here's the data set.
I'm also going to post this on the community .
The first thing I'm going to do is go ahead and show you
how I get to a visual state
where we can see all the Box Plots together,
without the distributions.
This is interesting,
but I'm going to go ahead and start from the beginning.
I'm going to analyze distribution.
I'm going to show you how I got there.
I'm actually going to click everything,
and JMP is going to give me a histogram and a Box Plot together.
A t the end of the presentation,
we're going to summarize why that's important.
But what I'm going to do here is I'm going to go ahead and turn off the histogram,
I can go ahead and customize the width here of the lines.
I can copy this customization over,
which is really nice.
I'm going to hold down the control key because I'm on a PC.
I'm going to right click
and then I'm going to hit Edit, Copy, P aste C ustomizations,
and that's going to bring them all over.
I'm actually going to hold down the control key again and resize this
so that I can resize them all together.
Now I'm going to minimize the quantile section,
because I'm going to get the information that I need
from the summary statistics section.
I actually have the IQR and a standard deviation shown here.
A lthough I could customize this, either here or in the properties,
which I can access under File, Preferences and under the distribution platform group.
What I'm going to do is I'm actually going to make
this information into a data table
and I'm going to right- click and select, Make Combined Data Table to do that.
Now I only need the IQR and the standard deviation.
I'm really only interested in that.
So I'm actually going to select, one of the standard deviations,
one of the IQRs.
I'm going to move my cursor over here
and select Matching on all of the rows
that have these values in them.
I'm going to go ahead and invert the selection,
delete the rows that I don't want,
and I'm left with this.
Now I'm just going to go ahead and restructure the data
so that I can calculate the ratio of the IQR to the standard deviation.
So I'm just going to use Table Split for that.
I'm going to go ahead and split by column 1
put column 2 in here, put these in a group.
I'm going to click OK.
I have the data how I want.
This shows me from which distribution
this statistics came from,
I'm going to go ahead and do a New Formula C olumn, Combine, Ratio.
There you go.
This looks a little bit hard to interpret for me.
I'm going to go ahead and change it so that I can only see two decimals.
I've got numbers which are very similar,
it should be, anyway, very similar to what I have in the slide here
detailing the ratio of the IQR to the standard deviation.
Of course, they're going to be different because there's sampling error.
This table is only one sampling experiment.
But this is how I can quickly
and interactively extract this information
and really understand, what does this ratio look like
if my data is not normal in a particular way?
We can see here, that the values that tend to be lower
at the peak distribution of the one with outliers,
the values that tend to be higher than the typical
or the expected theoretical 1, 3, 5 normal,
are going to be the uniform, the right skewed and bimodal.
Next question, what's the ideal outlier percent
if the distribution is perfectly normal?
Well, it turns out that if we look in the textbooks or reduce simulation,
on average, we should see about 0.7% of the points beyond the fences,
in a normal distribution,
or at least perhaps not beyond the fences, but if we were to do a control chart,
we would certainly see about,
which is under the assumption of normality.
For example,
if we were to do an individual moving range chart, we would see around 0.7%
of the points on average being outside the limits.
Although for practical purposes,
we could probably say,
if we saw about 3 % or less of the distribution beyond the limits,
we would consider it approximately normal.
Why is that question important?
Well, we can use the proportion of the points beyond the fences
in a Box Plot when the sample size is small
to determine whether or not we have some evidence of normality
on the basis of outliers.
Although if our sample size is too big,
then we're going to see lots and lots of points beyond those fences.
So it's really important that we consider a " reasonable sample size."
And that's part of the reason why we only considered 100 rows in our project.
Next question, what's the difference between
a quartile range and a quantile range Box Plot?
Well, in a practical context, anyway, we can talk about
the Explore Outlier utility in JMP 16,
which a llows us to adjust the Q,
which is the multiplier on the IQR
and the tail quantile,
which is essentially how the data is divided up.
We can customize that range.
I'm just going to show you what that looks like real quickly.
I'm going to go into Analyze,
I'm going to go to Screening,
Explore Outliers.
I'm going to do this on my raw data.
I'm actually going to close this.
I'm going to go back to the raw data table.
I'll just pick a couple of these.
I'll actually pick the ones that I have in my slides to peak in the outliers.
I'm going to go ahead and use the quantile range outliers.
I'm going to adjust this to what the Box Plot uses: 0.25 and 1.5.
I'm going to click Rescan
and JMP's going to identify potential outliers here.
How does this connect to the distribution platform?
Well, if we go over here,
we look at this,
what we're going to see,
is there are a number of outliers here.
I'm actually going to select the rows,
I'm going to go over here.
Well, lo and behold, it's these values.
So you got 1, 2, 3, 4, 5, 6,7.
There's seven outliers.
1, 2, 3, 4, 5, 6,7.
That squares up.
That's exactly what we would expect.
Similarly, we've got 1, 2, 3, 4,
and if we scroll over here, and under the outliers,
see if we're over here are to four.
Great.
Going back to the slides here, we can customize this .
And that's actually what we get into in subsequent Question 10.
How do we determine whether outliers are marginal or extreme?
Well, and why is it important?
Well, we can adjust the sensitivity
of the outlier detection based on the multiplier on the IQR
while keeping the tail quantile the same.
You might intuitively expect
that if you were to take Q₃ plus a larger number times the IQR,
it's going to extend the whisker length and similarly, on the lower side.
That's going to mean that more points are going to fall inside.
So less outlier would be detected.
We should be able to see that and test that in JMP.
So if I were to increase this to two and click Rescan,
we see a few outliers become part of the Box Plot,
or we can imagine a situation where that's the case.
I'll increase this to three,
I'll hit Re scan
we see even fewer outliers being identified still.
A s I go up to Q equal to five,
now I only have one outlier detected in the peak column of data.
So the idea here is that we can develop criteria for Q, for example,
we might situate it with three, a situation where data might be considered
a typographical error, where it might be an extreme or more extreme outlier.
We might set Q equal 1.5 if, for example,
we think that the potential outlier might be associated with variability
due to the measurement system or special process variation.
We can do some simulation based on our application
and decide on what the value of Q should be in these particular scenarios.
In connection with that, in Question 10,
we touched a little bit on GRR or measurement system variability.
Question 8 talks a little bit about,
it goes a little bit deeper into this and brings together some ideas.
The idea here is that we might actually consider
the distance between the upper fence and the first outlier
or the first potential outlier series of outliers.
We may extend that upper fence by a distance of two times the Sigma
due to the measurement system variability.
In this way, we're actually considering
the variability due to the measurement system.
And we're asking ourselves, is this potential value
within the noise of the measurement system or not?
We're creating a graphical way, a blended graphical means
of determining whether the value is reasonable
under the expectation tha t there's measurement system variability.
I have here the distance between the marginal outlier and the whisker
should be compared to the GRR noise standard deviation.
If it's within two standard deviations, we don't have 95 % confidence
to conclude this marginal outlier is different from the whisker.
This is just a graphical version of a one- sample T- test in effect.
We could actually construct a one- sample T- test
using this red line as our target and the observed value,
or rather assumed series of values, this black dot,
as our distribution relative to that target.
The next question, how many points do we need really to produce a Box Plot
if we're sample size limited?
Well, we might need at least seven points,
and our simulation in this particular sampling experiment shows that.
What's happening here?
Well, each of these three data sets have the same median.
You can see in this data set, there are six observations.
In this one there are seven, and then this one there are eight.
Let's start on the left, actually,
and we have eight observations, one out here around 15.
What if we reduce the number of observations to seven
and we actually included the same observation here,
but we reduce one of the others?
What if we reduce it further while maintaining the same median?
Then what we see is that this outlier 15,
which is still in the data set, no longer becomes an outlier.
In essence, it becomes absorbed into the whisker itself.
The other thing that's interesting about this simple experiment,
is that the IQR becomes inflated when we go from seven to six.
We can see that visually as that
the width of this box from Q1 to Q3 becomes much wider.
We can also see that numerically here.
I actually want to show you how we might explore that in JMP.
Here's some data.
It's not the same data, but here's some data.
I just created a column that ranks the data.
A gain, I just use an instant column formula.
I can do that by selecting one of these options,
so I believe it's under distributional.
Now, what I'm going to do is I'm going to go ahead and just clock this data.
I'm going to turn the histogram on its side.
I'm actually going to invoke the local data filter.
I'm going to bring in that rank column that I'm going to make it ordinal first,
so that I can select data individually
rather than under the assumption of the continuous distribution.
I'm going to select everything.
Now, let's see, if I go back to the data table,
I know that 8 represents the highest, the largest value.
I'll keep 8 in there,
and I'll just start reducing some of the lower values
by holding down my control key and clicking that,
which will effectively remove that point dynamically from this analysis.
I got the control key down and clicked again, click again.
You saw it there, that one outlier at the low side,
anyway, in this case, just disappeared.
There's a relationship among the distance between the fences and the points,
which is calculated on the basis of the data
where the median and the quartiles are calculated
based on the data that's in the analysis.
This gives you a better means of appreciating how the Box Plot
is changing as a function of data that's either in or out of the analysis.
This is a really super cool feature
that I really like to use a lot in many contexts.
What's the advantage of a Robust Fit Outlier algorithm,
which is a JMP 16 algorithm?
It gives us another means of detecting outlyingness .
We have the ability to use a Cauchy method
which often avoids the impact of skewness,
which can be useful for practical situations.
We can also use a 3-s igma or a K-sigma multiplier
in order to help detect outlyingness .
All of these methods really help us
separate potential outliers from real outliers
and help us create a reasonable signal detection and methodology
in a similar way that we might do if we were to use control charting
and build a control chart with limits
for our particular experimental or manufacturing application.
13. Can we include the sample size information in the Box Plot?
Well, this is where the Box Plot starts to present a clear limitation.
There isn't any sample size information explicitly in the Box Plot.
A lthough, we do have the ability in graph builder
to create a notch Box Plot, which gives you something
like a confidence interval on the median,
the edges indicate a confidence interval on the median.
We also have the ability in graph builder to invoke the caption box
which is a very useful feature for summarization of data
graphically without needing to provide an additional tabular data output.
But of course, that information is completely hidden to the Box Plot itself.
Connected to that is, can we make any decision
with any level of statistical confidence if we're just looking at the Box Plot?
The answer is no.
In this particular example, we actually designed it
so the medians were slightly different on average.
And so we're getting some separation among the medians between the groups.
We used to fit Y by X in this context.
What this shows is that the mean [inaudible 00:29:27]
represents the mean, the mean diamonds are non- overlapping.
It looks like all across all four groups being compared,
which indicates that there's some evidence
that there's a difference in the means between the groups.
We can also see the difference in the medians.
We can do a non- parametric test.
In this case, we're using a non- parametric steel test with control,
where the control is just the Z normal.
We're seeing some evidence of separation,
statistical separation among the medians in this particular instance.
It's hard for us to detect that and see that in the Box Plot.
In fact, it really isn't that clear at all.
How can we tell if we have any concern with respect to Kurt osis ?
What's Kurtosis?
Kurtosis is basically the idea that if it were a positive Kurtosis,
you would have data that's concentrated in the middle,
your data that's squished together into the middle of the distribution.
That's this example in the right.
If you had an idealized case of extreme negative Kurtosis,
you'd have a uniform distribution where the data is really spread out.
What you can see in these graphs relative to the normal distribution,
is that the 50 % dense zone, indicated by this red bar,
is basically about as long as the distance between Q1 and Q3 here,
but it's on one side of the median,
on one side of the median and the uniform case.
It's about as long as this box width then it's also on one side of the median.
That's a unique characteristic feature of this uniform distribution shape.
If we look at the peak situation,
we see that the box width is much more compressed
and the shortest half width is also about the same as the box width,
the shortest half is the most dense region rather as centered about the median.
That's similar to what you would see for the normal distribution case,
where the 50 % dense region will be about centered
on the mean or the median and about the same width as the box.
Clearly, the differentiator here is that the distances are reduced quite a bit.
Really the takeaway here, though,
is that this type of interpretation is really difficult.
And it would be easier for us to rely on the shape that's evinced by the histogram
than to try to look at the Box Plots separately.
Question 16 is very similar to 15.
What about in the context of data that has more than one mode?
What about a bimodal distribution?
Well, I just took the Box Plots and pulled them out on the left,
they're from the pictures on the right.
We can't really see a whole lot of difference among these.
It's difficult for us to interpret this.
But once we put the histograms, we can see clearly if we fit a two- peak distribution
that there's two modes in this data, and there's maybe one mode,
maybe a small mode, but really essentially one mode in the data on the left.
The Box Plot isn't particularly good
at detecting that presence of multiple modes.
The last question is, how many "normality violation failure modes"
can we detect with the Box Plot?
This question brings all the other ones together.
Well, if we have skewness,
we've shown that we have a strong ability to detect that.
If we have potential outliers,
we definitely have a strong ability to detect that.
If we have Kurtosis,
which is really related to the shape as is if there are multiple modes,
then we really don't have a strong ability to detect that.
If we're considering hypothesis testing,
we definitely don't have an ability to detect that either with the Box Plot.
What's the takeaways?
Well, the Box Plot is definitely a powerful visualization tool.
It's a great introductory tool,
and it has a wonderful ability to separate skewness
from potential outlying ness.
But it has its limitations.
In cases where we're looking at Kurtotic shape or a bimodality
or multimodality, the histogram is definitely a better choice.
That's really probably why JMP uses both the Histogram and the Box Plot together
in the distribution platform to visualize how the data is behaving,
if you will.
Of course, adding descriptive statistics
helps us really round up the picture where we have a graphical first approach.
This is just, again, a summarization of what we've discussed.
But the last couple of minutes,
I just want to show you a couple more things about the data set itself.
Because I think this is perhaps the most useful aspect of the project.
How might we set up a data set like this?
All we really have to do to simulate data in JMP
is just create some rows and then create a function,
a random normal function.
The process that we did, one way you could do this is you could say,
okay, you can go into a column formula…
Let me just show you this.
You can just double- click into it, and you can click Formula,
you can edit the formula.
You can go over here to these random functions,
you can click it in, and then you can specify a population mean in Sigma.
Zero and one by default, click OK, and then I can add a bunch of rows.
I'll go ahead and add 100 rows.
What about these other distributions?
A uniform distribution is, we can use the random uniform function.
And then we can specify a Min and Max value.
In this case,
I just specified the minimum of this column,
this normally distributed data column, and the Max is the maximum.
And then finally, as I mentioned,
I standardized the column so that it was on the same numeric scale.
This standardize this column,
standardized feature is common to all of these columns.
Now, the last thing I want to talk about real quick
is, well, what about peak?
What about right skewed, and even bi modal?
Well, one of the things we can do, which I really think is cool,
is we can use the distribution calculator in JMP
to help us understand what certain distribution types look like.
I'm just going to go into it here. I'm going to just drive down in here.
I'll share with you the location here of this script.
It's going to be under Calculator. It's not.
It's going to be under Distribution.
Generator.
Distribution calculator, on the calculators, yes.
How might I create a distribution that's right skewed?
Which random function would I use?
Well, I have the ability to look at some of these distributions and see
for example,
if I specify a random F and I specify these parameters,
then I'm getting a distribution with this kind of skewness.
And then I can say, well,
what happens if I change these parameters a little bit?
How is that going to change the distribution?
I can use this insight to specify the parameters
for the random distributions that I specify in my data set.
In fact, that's what I did here.
What did I do for the peaked one?
Well, if I look at the T distribution, and I reduce the degrees of freedom,
I'm going to get a distribution that's relatively peak.
I'm going to see a positive Kurtosis in that.
That's one way I can understand the shape of these distributions
so that I can use them to my advantage
to do different what if analysis in JMP.
I'm just going to quickly go back to my slides.
Thank you very much for listening.
If you have any questions,
I look forward to receiving them on the user community.
As I mentioned, this project will be posted there,
and the summary abstract is posted at this link here.
Thank you again.